In this tutorial, we will learn commands to work with and transform our data: subsetting data based on row values, arranging data within columns, summarizing our data, converting rows into columns (or vice versa), renaming variables, and recoding variables.
Our purpose here is to build on the tidy data ideas to build a better dataset. Doing so eases the transparency of our analyses, increases the clarity of our work to external audiences, and improves our ability to learn from data. This is particularly important because often imported data (including the data we have been using) is arranged poorly, named poorly, and needs to be significant attended to so that everything represents what it is supposed to or what is best for your analysis. Because these are so central to analysis, we have already encountered some of the functions in this section. In this tutorial, we will expand our knowledge of those functions, and learn a host of additional ones, all with the aim of improving our data.
We will learn the following R Commands:
filter()arrange()slice()rename()recode()mutate()na_if()replace_na()case_when()We will use the following datasets:
The first important function to learn is select(). This function allows us to pull a particular variable or set of variables from our dataset. For instance, we might want to use this with the very large anes dataset if we require only a few variables.
The basic format is: select(data, variables). Consider if we only wanted to look at the years 1990-1995 from the electricity dataset. We could narrow our dataset by writing:
# select 1990 - 1995 and country name from electric datasets
select(electric, `Country Name`, `Country Code`, `1991`:`1995`)
The first entry to the function is the name of the dataset, and each of the following is a variable name. If you are selecting between a range of variables, you can use :. Note, however, that this is within the range from the perspective of the columns of the dataset, not by temporal ordering.
A similar important function to learn is filter(). This is essentially select() for rows instead of columns. For example, with the electric data, we can look at specific countries instead of specific years.
The most basic format of this function is: filter(data, column == "value"). Below, we select Bangladesh (a row value in the Country Name column) from electric.
#Filter Bangladesh from electric
filter(electric, `Country Name` == "Bangladesh")
This returns the row containing Bangladesh’s access to electricity every year.
Take notice of the single quotes around the column name (as we had in the previous tutorial with summarize()) and the double quotes around the value name. If you are not consistent about this R grammar, you will run into errors.
Below, try using filter() to look at Mexico in electric.
#Filter Mexico from electric
#Filter Mexico from electric
filter(electric, `Country Name` == "Mexico")
In the first example, we used the logical operator == to choose the row in which Country Name was equal to Bangladesh. We first saw == in Tutorial 1 about R Basics; you may remember that there are other logical operators. As a refresher, those are:
== equals< less than> greater than<= less than or equal to>= greater than or equal toAll of these logical operators can be used in filter(). For example, below we select the rows in which access to electricity in 2016 was at least 95%.
#Filter the rows in 2016 that are greater than or equal to 95
filter(electric, `2016` >= 95)
In this example, 95 needed no quotes because it is a number. R instinctively knows to treat numbers as numeric values, which is the exact reason why the column 2016 still needs single quotes around it. In that case, we want R to treat the number 2016 nominally, not numerically.
Below, find the schools from the pseo dataset where graduates in the 75th percentile of earnings (p75_earnings) make more than $250,000/year.
#Filter pseo using p75_earnings
#Filter pseo using p75_earnings
filter(pseo, p75_earnings > 250000)
There are even more logical operators than the basic greater than/less than statements we learned in the first tutorial. Now that we can use them in filter(), they are useful to know:
& AND| OR! NOTDetails on using these and other logical operators can be found here. For those with basic computer science experience, using AND, OR, and NOT should be very intuitive. If not, here are some examples.
To find countries where electricity access has improved a lot between 1990 and 2016, we can select rows in which access in 1990 was below 20% AND access in 2016 was at least 80%.
#Filter the rows that are below 20 in 1990 AND above 80 in 2016
filter(electric, `1990` < 20 & `2016` > 80)
In this example, & is used to make sure that both conditions (1990 below 20, 2016 above 80) are true. OR (|) would be used in the same place as & to return rows in which either one condition or the other is true.
NOT (!) looks a little different because you often just have one condition. In the example below, we find countries where access to electricity is not 100% in 2016.
#Filter the rows that are NOT 100 in 2016
filter(electric, `2016` != 100)
#Another way to filter using !
filter(electric, ! `2016` == 100)
Let’s say you want to find out what countries have decent, but not excellent, levels of access to electricity. Use filter() and logical operators to find the countries in electric in which access was above 70 but below 85 in 2016.
#Filter the rows in electric that are above 70 and below 85 in 2016
#Filter the rows in electric that are above 70 and below 85 in 2016
filter(electric, `2016` > 70 & `2016` < 85)
Now that we have begun looking at rows instead of columns, we can begin transforming. First, we will learn how to rearrange rows.
At the moment, the rows of electric are arranged in the same way as they were imported: ordered alphabetically by Country Code, from ABW (Aruba) to ZWE (Zimbabwe). For someone looking to make meaningful scientific conclusions from this data, that arrangement is not very helpful.
We can use arrange() to reorder rows using the format: arrange(data, column). Let’s say we want to rank the countries in electric from those with the least electricity access to those with the most electricity access. In the example below, electric is rearranged based on the values in the column 2016.
#Arrange electric by 2016
arrange(electric, `2016`)
The output lists all the same countries, but they are ordered based on their 2016 value instead of their name. Like with filter(), single quotes are used to refer to the column 2016.
Looking back at Exercise 3 if needed, select schools with highly paid graduates (p75_earnings over a quarter million), and then arrange that output by earnings. Remember to use the pipe command %>% to take the output from one command and feed it into the next command. (Hint - If you get an error - look back at the earlier tutorial on the pipe command and see if you can figure out where you went wrong!)
#Filter schools with highly paid graduates and arrange in order of p75_earnings
#Filter schools with highly paid graduates and arrange in order of p75_earnings
filter(pseo, p75_earnings > 250000) %>%
arrange(p75_earnings)
You may notice that by default, arrange() lists values in ascending order, from lowest to highest. If you want to rank values in descending order from highest to lowest, you can put the column name inside of desc() as shown in the example below:
#Arrange electric by 2016 (descending)
arrange(electric, desc(`2016`))
Now, the output lists the countries from highest 2016 percentage to lowest 2016 percentage. If you look carefully, however, you will notice another problem… There are a lot of countries with 100% access in 2016. Since they are all basically tied for first place, the top countries are just listed alphabetically. This is not necessarily an issue, but in some situations, you might want a secondary tie-breaker criteria to order rows with the same first criteria from your arrange() function. We address this issue in the following section.
Looking back at Exercise 3 and 4 if needed, select schools with highly paid graduates (p75_earnings over a quarter million), and then arrange that output in descending order. Remember to use the pipe command %>% to take the output from one command and feed it into the next command.
#Filter schools with highly paid graduates and arrange in order of p75_earnings
#Filter schools with highly paid graduates and arrange in order of p75_earnings
filter(pseo, p75_earnings > 250000) %>%
arrange(desc(p75_earnings))
You can list multiple columns within arrange() to create such “tie-breakers”. Let’s say you want to differentiate between countries that have jumped to 100% in the past year and countries that have mantained 100%. If you use 2015 as a secondary column in arrange(), the 100% countries in 2016 will still appear at the top of the list, but they will be ordered based on their 2015 value. See the example below.
#Arrange electric by 2016, then 2015 (descending)
arrange(electric, desc(`2016`), desc(`2015`))
Examine the rows and compare them to previous example. The results are only slightly different- but if you want to set up an elaborate sorting or ranking formula, those slight differences are useful to capture.
slice() is an easy-to-use function to accompany arrange(). It “slices” out a certain number of specified rows from data. It is basically a more complex, practial version of head(), the function we used in Tutorial 2 to preview the first five rows of different datasets.
slice() follows the format: slice(data, rows). Below, without any rearranging, we slice the first 10 rows in electric.
#Slice the first 10 values in electric
slice(electric, 1:10)
By using slice() like this after arrange(), we have an easy method for finding the “top 10” or “top x” of any ranked data.
Let’s revisit piping with %>% to combine slice() and arrange(). Below, we rearrange and slice out the 10 countries with the least access to electricity in 2016.
#Slice the 10 lowest-access values in electric
electric %>%
arrange(`2016`) %>%
slice(1:10)
After arranging the data based on 2016, the top 10 rows are then sliced out. Why is this more useful than just looking at the first 10 values in the ascending list we arranged earlier? This method returns a 10-row subset that we can examine, run statistics on, or graph independently from the full data.
Use arrange() and slice() to find the top 10 highest-earning rows in pseo based on the column p50_earnings.
#Arrange pseo based on p50_earnings, then slice top 10 values
#Arrange pseo based on p50_earnings, then slice top 10 values
pseo %>%
arrange(desc(p50_earnings)) %>%
slice(1:10)
slice() can be particularly useful in conjunction with group_by(). Suppose we only want to know the 75% percentile salary for the highest earning alumni group from each college or university. In the pseo dataset, each institution appears multiple times based on different alumni groups and graduation cohorts. We can use group_by(), as introduced in tutorial 3, in conjunction with other commands to find the top values for different schools. We can then use slice to return the name of all institutions only ONCE. In the example below, we arrange pseo by p75_earnings, group by institution_name, then slice by 1. The result contains the highest-earning row for each institution in the dataset.
#Arrange pseo based on p75_earnings
#select the columns institution_name and p75_earnings
#then slice top value
pseo %>%
arrange(desc(p75_earnings)) %>%
select(institution_name,p75_earnings)%>%
group_by(institution_name) %>%
slice(1)
Even though you slice by 1, the resulting output contains 47 rows: one for each group group/institution. Group_by is extremely versatile and can be used with many different functions.
One issue is that we still are not getting the top 10 values in order - the returned data appear to be in alphabetical order. This is because the data are grouped by institution name, and we need to use ungroup() before we can sort by the p75_earnings information. Try using ungroup() after slice - what happens? Can you return the top 10 values of p75_earnings after using ungroup().
#Slice the top value of p75_earnings for each instution
#Return the columns institution_name and p75_earnings sorted by p75_earnings
#Slice the top value of p75_earnings for each instution
#Return the columns institution_name and p75_earnings sorted by p75_earnings
pseo %>%
select(institution_name,p75_earnings)%>%
group_by(institution_name) %>%
arrange(desc(p75_earnings)) %>%
slice(1)%>%
ungroup()%>%
arrange(desc(p75_earnings))%>%
slice(1:10)
It’s very easy to rename columns in a dataset with the function rename().
Let’s look back at the anes datatset. As you can see below, the data is currently unintelligible because the column names are all strings of numbers and letters, while the row values are all numeric.
head(anes)
All of these numbers represent survey data from thousands of respondents. The columns represent different questions. We can find out what they mean by looking up the codebook for this data set. A codebook is a document that explains what data represents; almost all survey datasets have an accompanying codebook.
ACCESSING THE CODEBOOK: Clicking the link above brings you to an index page. To access the codebook for the variables available in this tutorial, click Sequential Variable List on the left sidebar, then scroll down and click on POST-ELECTION QUESTIONS. The list of questions there contains the data used in this tutorial.
For example, the first column, v162002 represents the question, “How many programs about 2016 campaign did Respondent watch on TV?” (1 = None, 2 = Just one or two, 3 = Several, 4 = A good many).
We can rename this column “Media_Exposure” with the function rename(), which follows the format: rename(data, new_column_name = old_column_name).
#Rename v162002 as Media_Exposure
rename(anes, Media_Exposure = v162002)
It appears successful… but look what happens if we try to preview anes again…
#Preview to check
head(anes)
The renamed column did not stay that way. The rename() function just outputs data with new column names; in order to save the data that way, we need to use <- to assign the new output onto the old data. In the example below, we accomplish the same renaming as before, but this time it is permanent.
#Rename v162002 as Media_Exposure
anes<-rename(anes, Media_Exposure = v162002)
#Preview to check
head(anes)
Using <- before recode() is basically telling R: “rename the column, and keep it that way.”
You can also rename multiple column names at once using commas. Below, the next three columns are renamed according to their definition in the codebook. The method below uses piping, but it is the same function as before.
#Rename v162003, v162004, and v162005 as Radio_Exposurre, Internet_Exposure and Newspaper_Exposure
anes<-anes%>%
rename(Radio_Exposure = v162003,
Internet_Exposure = v162004,
Newspaper_Exposure = v162005)
#Preview to check
head(anes)
In the piping above, we did not need an %>% at the end of every line because those lines were contained within the rename() function. You only need %>% in between different functions, not different arguments of the same function, when piping.
Use the codebook and rename() to appropriately rename the next column, v162006.
#Rename v162006 according to the codebook
#Rename v162006 according to the codebook
anes<-rename(anes, Website_Exposure = v162006)
Renaming confusing column names was easy. The substantive part of recoding data is changing numeric row values into nominal values that different numbers represent.
We will need to two functions to accomplish this in large datasets, recode() and mutate(). First, let’s learn the basics of recode() using very simple example data.
Below, we create a vector containing 1’s, 2’s and 3’s called fruit. We can use recode() to change this vector so that 1 represents apple, 2 represents banana, and 3 represents strawberry. As you can see, recode() follows the format: recode(data, oldvalue1 = newvalue1, oldvalue2 = newvalue2, etc). As in prior tutorials, single quotes are used around the numeric values and double quotes are around the nominal values.
#Create fruit vector
fruit <- c(1, 2, 1, 3, 2, 1, 1, 3, 2)
#Recode fruit values
recode(fruit, `1` = "apple", `2` = "banana", `3` = "strawberry")
## [1] "apple" "banana" "apple" "strawberry" "banana"
## [6] "apple" "apple" "strawberry" "banana"
Below, try recoding fruit so that 1 represents lemon, 2 represents lime, and 3 represents orange.
#Recode fruit into: lemon, lime, orange
#Recode fruit into: lemon, lime, orange
recode(fruit, `1` = "lemon", `2` = "lime", `3` = "orange")
Unfortunately, when working with dataframes instead of simple vectors, we will need more than just recode() to recode entire variables. There are many methods of doing so, but the most efficient way is to use mutate(), a function that creates new columns in an existing dataframe. When recoding variables with mutate(), we will not actually change the numbers within a column; rather, we can create a new column that contains recoded values.
Let’s go back to the anes data to learn how this is done. Specifically, let’s look at the column v162034a, which if you look at the codebook, stands for the survey question: “For whom did Respondent vote for President?”. In the example below, we use mutate(), following the format: mutate(data, newcolumn = columndata), to create a new column called Vote_16. For the column data, we can imbed recode() into this function and recode data from v162034a in the way we learned before. And, like earlier in the tutorial, we will use <- to save the data this way.
#Recode v162034a according to the codebook:
anes<-mutate(anes, Vote_16 = recode(v162034a, `1` = "Clinton", `2` = "Trump", `3` = "Johnson", `4` = "Stein"))
## Warning: Unreplaced values treated as NA as .x is not compatible. Please specify
## replacements exhaustively or supply .default
#Preview Vote_16 vs. original
head(select(anes, Vote_16))
head(select(anes, v162034a))
Just for reference, shown below is the slighly different format we would use to accomplish the same recoding using piping.
#Recode v162034a according to the codebook (with piping)
anes<-anes%>%
mutate(Vote_16 = recode(v162034a, `1` = "Clinton", `2` = "Trump", `3` = "Johnson", `4` = "Stein"))
## Warning: Unreplaced values treated as NA as .x is not compatible. Please specify
## replacements exhaustively or supply .default
In any case, we now have a column called Vote_16 which contains all the voting data from v162034a without any of the confusing numbers and codes. We can use this new column as we would use other columns to get straightforward statistics.
#Make a proportional table for Vote_16
prop.table(table(select(anes, Vote_16)))
##
## Clinton Johnson Stein Trump
## 0.49349656 0.04399388 0.01185922 0.45065034
If we see a warning about NAs and you would like to know how many cases have an NA code (perhaps to compare to codebook information about which cases should have missing values for one of the provided reasons), we can include NAs in the table command as follows.
#Make a proportional table for Vote_16
table(select(anes, Vote_16),useNA = "ifany")
##
## Clinton Johnson Stein Trump <NA>
## 1290 115 31 1178 1657
Using mutate() and recode(), recode the column v162132 according to the codebook. You will have to refer to the codebook to see what different values of this variable represent. Check your work by examining your new column of recoded data with select() or table().
#Recode v162132 according to the codebook
#Recode v162132 according to the codebook- then check work
anes<-anes%>%
mutate(Class = recode(v162132, `1` = "Lower", `2` = "Working", `3` = "Middle", `4` = "Upper"))
#Check work
table(select(anes, Class))
One nice thing about using mutate is that you can recode multiple variables in a single statement. In the example below, we use mutate() to recode two more variables, v162034, “Did Respondent Vote for President?”, and v162035, “Preference strong for Pres candidate for whom Respondent voted?”.
#Recode v162034 and v162035 according to the codebook
anes<-anes%>%
mutate(Did_Vote = recode(v162034, `1` = "Yes", `2` = "No"),
Vote_Preference = recode(v162035, `1` = "Strong", `2` = "Not strong")
)
## Warning: Unreplaced values treated as NA as .x is not compatible. Please specify
## replacements exhaustively or supply .default
## Warning: Unreplaced values treated as NA as .x is not compatible. Please specify
## replacements exhaustively or supply .default
#Preview results
prop.table(table(select(anes, Did_Vote)))
##
## No Yes
## 0.01428571 0.98571429
prop.table(table(select(anes, Vote_Preference)))
##
## Not strong Strong
## 0.2921053 0.7078947
Use the format above if you want to recode multiple variables at once in the future.
So far in these tutorials, we have largely ignored NA values. NA stands for missing data. Many datasets understandably come with missing data, since it’s not always possible to collect all the available data when researching. Although NA’s might mean that your data is incomplete, they are also very easy to ignore when making calculations. In this way, an NA can be more helpful than an irrelevant value.
There are two easy functions we can use to work with missing data: replace_na() and na_if().
First: replace_na() is used to replace all NA cases with a specified value. It follows the basic format, replace_na(data, replacement) In the example below, we create another fruit vector, but this time it contains NA values. Then, those NA values are replaced with “Unknown”.
#Create fruit vector
fruit <- c(1, 2, NA, 1, 3, 2, NA, NA, 1, 1, 3, 2)
#Replace NA's
replace_na(fruit, "Unknown")
## [1] "1" "2" "Unknown" "1" "3" "2" "Unknown"
## [8] "Unknown" "1" "1" "3" "2"
When working with data frames, you need to slightly modify this format in order to specify what column to replace NA’s within. This format is: replace_na(data, list(column = replacement)). Below, using the pseo data, we replace NA in the p25_earnings column with “Not Available”.
replace_na(pseo, list(p25_earnings = "Not Available"))
You can list multiple columns with unique replacements, if you want.
replace_na(pseo, list(p25_earnings = "Not Available", p50_earnings = "No Data", p75_earnings = "Unavailable"))
Whether you are making these changes for aesthetics or to help with your analysis, they are easy to accomplish.
Na_if() is basically the opposite of replace_na(). It converts specified values into NA. This is helpful if you want to ignore irrelevant values of a dataset in your calculations.
It follows the format: na_if(data, value). Pretty simple. Below, we take another fruit vector, but this time we replace 9’s with NA.
#Create fruit vector
fruit <- c(1, 2, 9, 1, 3, 2, 9, 9, 1, 1, 3, 2)
#Replace 9's
na_if(fruit, 9)
## [1] 1 2 NA 1 3 2 NA NA 1 1 3 2
Let’s go back to the anes survey data. For every question, there are responses that are not necessarily relevant for analysis. For example, in the presidential vote question, v162034a, the codebook tells us that -1 means “Innappropriate”, meaning the respondent was not asked this question because of other answers (for example, if they previously answered that they did not vote for president, they were not asked who they voted for). We can use na_if() to replace -1 with NA. Below, we do this using mutate() in order to create a new column with this modification.
#Replace -1 in v162034a with NA (as a new column)
anes <- anes %>%
mutate(Vote_withNA = na_if(v162034a, -1))
#Preview Vote_withNA vs. original
head(select(anes, Vote_withNA))
head(select(anes, v162034a))
This is a case where having NA’s is better than having specified answers with no significance for analysis.
The column v162152b stands for the question: “Does Respondent favor or oppose limits on foreign imports?” 1 = “Favor”, 2 = “Oppose”, and 99 = “Haven’t thought much about this”. Recode this column so that “Haven’t thought much about this” values are converted to NA. Then check your work.
#Replace 99 in v162152b with NA
#Replace 99 in v162152b with NA
anes <- anes %>%
mutate(Imports = na_if(v162152b, 99))
#Check your work
prop.table(table(select(anes, v162152b)))
prop.table(table(select(anes, Imports)))
Case_when is a more complex, but more powerful, alternative to recode() that can be used to recode variables. In basic data analysis and recoding, you will likely be able to get by with only recode(), but being introduced to case_when() may prove useful for bigger projects.
In the example below, case_when() is used to recode v162034a, the presidential vote question, exactly as it was done earlier with recode(). As you can see, this function still takes place within mutate().
#Recode v162034a using case_when()
anes<-anes%>%
mutate(Vote_16 = case_when(
v162034a == 1 ~ "Clinton",
v162034a == 2 ~ "Trump",
v162034a == 3 ~ "Johnson",
v162034a == 4 ~ "Stein")
)
table(select(anes, Vote_16))
##
## Clinton Johnson Stein Trump
## 1290 115 31 1178
Case_when() uses a series of arguments that are separated by commas and uses ~ as a sort of equals sign. Think of the ~ as meaning “becomes” or “turns into”. For example, in the first line of the function in the example above, v162034a == 1 ~ "Clinton" means: “In rows where v162034a equals 1, values in this new column become Clinton”.
Unlike recode(), which directly replaces column values with new column values, case_when() creates values based on individually constructed arguments. This is useful because we can create any kind of arguments, such as with logical operators like > or !. It is also useful because these arguments can refer to any variables, whereas recode() is constricted to a single variable.
In the example below, case_when() is used to create a new variable Vote_Strength constructed from both the presidential vote variable, v162034a (1 = Clinton, 2 = Trump), and the vote strength variable, v162035 (1 = Strong Preference, 2 = Weak Preference). This is something that can only be achieved with case_when().
#Recode v162034a using case_when()
anes<-anes%>%
mutate(Vote_Strength = case_when(
v162034a == 1 & v162035 == 1 ~ "Strong Clinton Vote",
v162034a == 1 & v162035 == 2 ~ "Weak Clinton Vote",
v162034a == 2 & v162035 == 1 ~ "Strong Trump Vote",
v162034a == 2 & v162035 == 2 ~ "Weak Trump Vote")
)
#Check your work
table(select(anes, Vote_Strength))
##
## Strong Clinton Vote Strong Trump Vote Weak Clinton Vote Weak Trump Vote
## 939 858 349 319
This opens the door for much more advanced recoding of variables.
Here’s one last technical tip: We can create a final argument within case_when() that applies to all leftover rows which have not yet been modified. This final argument follows the format: TRUE ~ value. If this final argument is not specified, the leftover rows are set to equal NA by default.
In the example below, we recode v162034a and v162035 together just like above, but this time, we use a final argument in which we set all other row values to “Not Available”.
#Recode v162034a using case_when()
anes<-anes%>%
mutate(Vote_Strength = case_when(
v162034a == 1 & v162035 == 1 ~ "Strong Clinton Vote",
v162034a == 1 & v162035 == 2 ~ "Weak Clinton Vote",
v162034a == 2 & v162035 == 1 ~ "Strong Trump Vote",
v162034a == 2 & v162035 == 2 ~ "Weak Trump Vote",
TRUE ~ "Not Available")
)
#Check your work
table(select(anes, Vote_Strength))
##
## Not Available Strong Clinton Vote Strong Trump Vote Weak Clinton Vote
## 1806 939 858 349
## Weak Trump Vote
## 319
This final argument can help classify various leftover data, or it can help illustrate how many cases are being ignored by overly specific case_when() arguments that precede.
You now have most of the tools for recoding variables. In the last section of this tutorial, we are going to put those to the test to practially recode different data and convert data from one type to another.
Using mutate() and case_when(), recode the column v162134 according to the codebook. You will have to refer to the codebook to see what different values of this variable represent. Check your work by examining your new column of recoded data with select() or table().
#Recode v162134 using case_when() according to the codebook
#Recode v162134 using case_when() according to the codebook
anes<-anes%>%
mutate(Opportunity = case_when(
v162134 == 1 ~ "A great deal",
v162134 == 2 ~ "A lot",
v162134 == 3 ~ "A moderate amount",
v162134 == 4 ~ "A little",
v162134 == 5 ~ "None",
TRUE ~ "Not Available"
)
)
#Check work
table(select(anes, Opportunity))
In class, we learned about the difference between nominal, ordinal, and interval data. These are different levels of measurement. In R, data is most often measured as either numeric, categorical, or binary. In R, we are able to recode data to change it from one measurement level to another. It’s difficult to make data more specific (ex. categorical to numeric, binary to categorical), but it’s easy to simplify numeric data to categorical data, numeric data to binary data, or categorical data to binary data.
This will make more sense when we look at examples. The following examples use a “Feeling Thermometer” question from the anes survey. In these questions, respondents chose a number between 0 and 100 to show their feeling towards a particular subject (0 being bad, 100 being good). The question we will be looking at, v162110, measures respondents’ feelings toward police from a 0-100 scale.
Reducing measurement level can make data easier to understand. Saying that respondents rate their feelings toward police as a 64.4 on average is only so helpful. In the examples below, we recode and categorize this numeric variable in different ways.
In the example below, we use mutate() and case_when() to split the numeric feeling thermometer question about police attitudes into six categories: Like Strongly (90+), Like (65-90), Like Slightly (50-65), Dislike Slightly (35-50), Dislike (10-35), and Dislike Strongly (below 10). Notice how >= (greater than or equal to) and < (less than) are used with the logical operator & to make categories without overlap. These categories are factors in a new column, Police.
At the end, table() is used to preview these new categories (and to check that it worked).
#Recode v162034a using case_when(), numeric -> categorical
anes<-anes%>%
mutate(Police = case_when(
v162110 >= 90 ~ "Like Strongly",
v162110 >= 65 & v162110 < 90 ~ "Like",
v162110 >= 50 & v162110 < 65 ~ "Like Slightly",
v162110 >= 35 & v162110 < 50 ~ "Dislike Slightly",
v162110 >= 10 & v162110 < 35 ~ "Dislike",
v162110 < 10 ~ "Dislike Strongly")
)
#Check your work
table(select(anes, Police))
##
## Dislike Dislike Slightly Dislike Strongly Like
## 137 204 697 1714
## Like Slightly Like Strongly
## 522 997
As you can see, this breakdown is much more meaningful than simply presenting the mean of the data.
In this example, we use the same functions as before, but we are creating a column called Favor_Police that simply is TRUE if a respondent rates police as above 50 and FALSE if a respondent rates police as below 50. You may want very simple logical data as an alternative to creating many categories.
#Recode v162034a using case_when(), numeric -> binary
anes<-anes%>%
mutate(Favor_Police = case_when(
v162110 >= 50 ~ TRUE,
v162110 < 50 ~ FALSE)
)
#Check your work
table(select(anes, Favor_Police))
##
## FALSE TRUE
## 1038 3233
If you have categorical data to start with, you may still want to turn it into true/false format. This last example is a synthesis of the previous two. We are again coding binary data in Favor_Police, but it is being created out of the six categories we made first. When Police is equal to any of the favorable categories (using the logical operator |), Favor_Police is set to TRUE, and when Police is equal to any of the unfavorable categories, Favor_Police is set to FALSE.
#Recode v162034a using case_when(), categorical -> binary
anes<-anes%>%
mutate(Favor_Police = case_when(
Police == "Like Strongly" | Police == "Like" | Police == "Like Slightly" ~ TRUE,
Police == "Dislike Strongly" | Police == "Dislike" | Police == "Dislike Slightly" ~ FALSE)
)
#Check your work
table(select(anes, Favor_Police))
##
## FALSE TRUE
## 1038 3233
This exercise has three parts, following the format of the three examples above. In these exercises, you will be reducing the measurement level of the 2016 column of electric. Like the question in anes about police, the values of electric range from 0 to 100. Although these numbers are about percentages of population, not “feeling thermometer” answers, the recoding will work the same way.
Create a new column named Condition to reduce the numeric 2016 column into four categories:
Create a new column named Good_Condition to reduce the numeric 2016 column into two states:
TRUE, when access is greater than or equal to 80%FALSE, when access is less than 80%Recreate the column Good_Condition by reducing the categorical Condition into two states:
TRUE, when access is Full or GoodFALSE, when access is Okay or Poor#Recode 2016 using case_when(), numeric -> categorical
#Check your work
#Recode 2016 using case_when(), numeric -> binary
#Check your work
#Recode 2016 using case_when(), categorical -> binary
#Check your work
#Recode 2016 using case_when(), numeric -> categorical
electric<-electric%>%
mutate(Condition = case_when(
`2016` >= 99 ~ "Full",
`2016` >= 85 & `2016` < 99 ~ "Good",
`2016` >= 60 & `2016` < 85 ~ "Okay",
`2016` < 60 ~ "Poor")
)
#Check your work
table(select(electric, Condition))
#Recode 2016 using case_when(), numeric -> binary
electric<-electric%>%
mutate(Good_Condition = case_when(
`2016` >= 80 ~ TRUE,
`2016` < 80 ~ FALSE)
)
#Check your work
table(select(electric, Good_Condition))
#Recode v162034a using case_when(), categorical -> binary
electric<-electric%>%
mutate(Good_Condition = case_when(
Condition == "Full" | Condition == "Good" ~ TRUE,
Condition == "Okay" | Condition == "Poor" ~ FALSE)
)
#Check your work
table(select(electric, Good_Condition))
You now know the basic commands for subsetting, rearranging, and reshaping data. This is a crucial set of skills for working with messy datasets and preparing them for analysis. You can also rename and recode variables, creating better and more transparent datasets for our analyses.
Coming up in the next tutorial…with our data prepared, we will learn some approaches for describing and understanding our data.
Here’s a recap of the functions you have encountered so far:
Tutorial 1: R Basics and Assigning Variables
+,-,*,/==,>,<,>=,>=<-c()Tutorial 2: Reading and Describing Data
head()dim()colnames()select()
starts_with()ends_with()contains()table()prop.table()gather()spread()Tutorial 3: Intro to Visualization
xtabs()ggplot()
geom_histogram()geom_bar()geom_point()geom_boxplot()geom_violin()geom_smooth()Tutorial 4: Advanced Visualization
xtabs()ggplot()
facet_wrap()aes(fill =)Tutorial 5: Transforming Data
filter()
&, |, !arrange()slice()rename()recode()mutate()na_if()replace_na()case_when()